nvidia-drm: handle -EDEADLK in nv_drm_reset_input_colorspace#1031
nvidia-drm: handle -EDEADLK in nv_drm_reset_input_colorspace#1031jopamo wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
|
Thanks for the patch and backtrace. What are the steps you're using, or particular configuration, to trigger the problem? |
|
GPU/driver: 5060 Ti — 590.48.01 This happened during boot while I was debugging an unrelated kernel module (act_gate). I wasn’t actively exercising the DRM stack; the warning showed up as part of early system bring-up with heavy lock debugging enabled. From what I can tell, the relevant locking contract in the kernel source is very explicit about
Given the above, it looks like the atomic path needs to follow the standard ww-mutex backoff/retry sequence when I haven't had stability issues in conjunction with this, but this doesn't appear to be a false positive based on kernel docs. |
|
I think we would need to explore Wound/Wait Deadlock Prevention conceptually more before being able to approach any sort of fix. My take on this is the issue seen only reproduces with kernel lock debugging enabled, and we have not seen any live issues of this that we are currently aware of. nvidia-drm does not make use of TTM or any of the upstream GPU resource managers, so it could be that the design difference is falsely triggering the deadlock detector. |
|
With The fact that this only reproduces with lock debugging enabled is expected. Lockdep is designed to expose latent ordering bugs that depend on timing. Not seeing a production deadlock does not establish correctness. The atomic helpers assume drivers implement the documented backoff pattern. On TTM, its absence is not directly relevant. The ww_mutex and modeset locking rules apply to any driver using DRM atomic helpers regardless of memory manager. The expectations come from DRM core, not TTM. If anything, diverging from common upstream patterns makes strict adherence more important. This is not a lockdep false positive due to design differences. It is a missing retry path in a ww_mutex context and lock debugging simply makes it visible. |
|
Took a closer look. The GSP-RM path still leads to GSP RPC timeouts (Xid 119). In my testing the DRM-side error path is effectively masked because GSP wedges first. I can reproduce a GPU hang/reset by racing DRM atomic commits with DROP_MASTER/SET_MASTER. The failure manifests as a GSP RPC timeout: fn 76 (GSP_RM_CONTROL) data0=0x20800a6a data1=0x0, followed by Xid 62/109/119 and “GPU reset required”. Adding a small delay between DROP_MASTER and SET_MASTER reduces the reproduction rate which suggests a tight timing window.
|
|
nevermind, I can hit both with lock debugging off.
|
|
Thanks, I took some time to read through core DRM more to understand this change. The documentation for We need to go through every usage of |
|
Tracking this effort in NVBug 5930847. |
2879a61 to
f66cb86
Compare
|
Thanks. I had expanded beyond the original commit but I missed one of the sites and didn't fully understand the interaction with the GSP. So far in testing, with the updated patch, I can no longer reproduce the issue. |
|
Thanks a lot @jopamo. I think the commit description would need to be updated as well. Would you mind referencing https://docs.kernel.org/gpu/drm-kms.html#c.drm_modeset_acquire_ctx In general, I am not sure we want so many jumping labels vs a |
Right, I actually have not gotten down to tracing the GSP interaction/am not entirely comfortable about the relationship. What I did was take a moment go through core DRM to understand the locking implications, so I could better understand the warning you were seeing and the intent of your change. Once I conceptually understood the details behind If you can repro the issue with the GSP traces and collect the logs with |
|
Yes I can make those changes. I already sent the repro by email. The issue is a local GPU DoS (wedge/reset-required) triggered by a timing-sensitive KMS/DRM master handoff + atomic activity race. It's not safe. |
|
Sorry for the confusion. The GSP bug still exists separately. My system locks hard, no pulling logs beyond this.
|
Would you mind sharing the subject of the email and To: address? Was this sent to linux-bugs@nvidia.com? Sorry for needing to cross-reference here.
This matches my expectation/I wouldn't expect the GSP issue to be related to DRM's locking safety semantics. It's good for |
|
My fault, I was intentionally vague because I didn't (still don't) understand the full impact. When I said "hit both", I meant two distinct bugs and I followed up immediately with a more detailed email conveying that I do not seem to be able to affect the GSP issue from any changes in the open source side so I left it alone. I thought your followup was a part of this. psirt ticket #5916985 |
|
Thanks, let me sync with our internal teams so I can see what you reported in that psirt ticket. I usually just scan the GitHub repos on my free time to make sure we do not miss anything. |
drm_atomic_get_plane_state() and drm_atomic_commit() can return -EDEADLK when ww-mutex deadlock avoidance triggers. The retry contract is tied to struct drm_modeset_acquire_ctx: if an operation with an acquire context returns -EDEADLK, callers must run drm_modeset_backoff() and retry the full atomic sequence with the same context. Reference: https://docs.kernel.org/gpu/drm-kms.html#c.drm_modeset_acquire_ctx
f66cb86 to
4642624
Compare
drm_atomic_get_plane_state() and drm_atomic_commit() can return -EDEADLK when ww-mutex deadlock avoidance triggers. The current
nv_drm_reset_input_colorspace() path drops locks and returns without running the required modeset backoff/retry flow.
Rework the function to retry the atomic sequence with drm_modeset_backoff(&ctx), rebuilding atomic state on each retry, and only finish once the sequence succeeds or another error is returned.